home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Celestin Apprentice 5
/
Apprentice-Release5.iso
/
Source Code
/
PowerPlant
/
DMultiStringLocator
/
About DMultiStringLocator
next >
Wrap
Text File
|
1996-07-06
|
5KB
|
113 lines
About DMultiStringLocator
=========================
Written July 6, 1996 by Eric Gundrum <eric@teamworks.com>
Preface
-------
This document describes the DMultiStringLocator class. It contains these
sections:
- What is DMultiStringLocator
- Implementation Notes
- About the Sample Code
- Use Restrictions
What is DMultiStringLocator
---------------------------
DMultiStringLocator is a C++ class to provide fast searching for one or
more strings. It is an implementation of a table-based algorithm
credited to Aho and Corasick and described in "Practical Algorithms for
Programmers" by Andrew Binstock and John Rex (published in 1995 by
Addison Wesley).
DMultiStringLocator is designed to search for more than one string at a
time. Searching for multiple strings works nearly as fast as searching
for a single string. Using a table-lookup mechanism eliminates any
unnecessary backtracking when a partial match fails, accounting for much
of the speed.
This algorithm may not be the fastest available, but it should be
sufficient for most practical problems. It is general enough to be used
when searching for a single string, as well as for more than one in the
same target text.
Implementation Notes
--------------------
The current implementation expects a list of DSearchString objects when
the DMultiStringLocator object is instantiated. Internal state tables
are built from this list in the constructor. (This allocates memory;
about eight times the combined length of the search strings and at least
two times the alphabet size.) Changing the list requires constructing a
new search object (and its state tables).
A DSearchString object, based on PowerPlant's LStr255 class, contains a
ReportFound() virtual member function that must be overridden in a
derived class. ReportFound() is executed for each found string. (This is
a performance bottleneck; keep ReportFound() short and fast.) A
parameter to ReportFound() is the position in the string of the last
character of the found string. Use this value to locate the found
strings after the search phase.
A search is initiated with a call to SearchBuffer(), which will search a
RAM-based buffer (by walking a pointer through the buffer). Searching
more data than will fit in one buffer can be accomplished with repeated
calls to SearchBuffer(), but you must deal with a match string that is
split over two buffers.
To help handle strings split across the buffer, SearchBuffer() returns
the maximum number of characters which might belong to a string found at
the end of the current buffer. These characters should be included in
the next buffer. However, if these characters also include other
complete strings, those strings will be found twice.
For example, consider searching for 'foobarooba' and 'bar', where the
first buffer ends with 'foobaroo'. SearchBuffer() will return 8,
indicating that the second buffer should begin 8 characters before the
end of the first. However, such as search will find 'bar' twice. This is
a design flaw I have not yet resolved. (It is not significant to my
use.) I suggest working around it by looking for multiple hits on the
same string in a post process phase (by tracking the hit position). Or,
handle it in ReportFound() if performance is not an issue.
Another limitation of this implementation is that it searches for a byte
sequence. It does not support case-insensitive searches. However, that
is easily remedied by changing the case of the search strings and input
text.
DMultiStringLocator currently relies on a few PowerPlant classes,
including LList, LListIterator, and LString. With a bit of effort, these
classes could be replaced with STL classes. I'll leave that for a future
project.
About the Sample Code
---------------------
The sample code consists of the CSearchWindow class to demonstrate the
search engine and a supporting PowerPlant application (PPShell).
CSearchWindow builds the search list from user input (in DoSearch()). It
then reads the target file into a buffer and performs a search on that
buffer (in SearchFile()). If the file contains more data, the process is
repeated until all is read. Note that SearchFile() will overlay the
buffer ends to a deal with match strings split over two buffers.
CSearchString is an important support class used by CSearchWindow.
Derived from DSearchString, CSearchString demonstrates how to minimally
collect the results of a search with the ReportFound() function. Here
ReportFound() is used to count the number of hits. It also tracks the
position of found strings to prevent reporting them twice.
Use Restrictions
----------------
I place no restrictions on how this code may be used or distributed. I
ask only that my copyright statement be retained in the source files of
all derivative works. (I put a lot of work into this class and I want
people to know who did it.)
If you have comments or want to report bugs, send mail to
<eric@teamworks.com>.